Introduction to Machine Learning

HBIGS Core Course, June 23rd-24th, 2025

Author

Florian Huber
florian.huber@flohub.de
www.flohub.de

About this Course

General notes

  • Introduction to statistical learning = machine learning (ML).

  • No prior knowledge of ML is required but some prior knowledge of R.

  • Mix of theory and exercises.

  • From 9:00-17:00 with a one-hour lunch break.

  • Please ask questions and I am happy about feedback.

. . .

  • IMPORTANT NOTICE:
    • Many figures, text, formulas and partially exercises for this course are taken from the following book: An Introduction to Statistical Learning (James, Witten, Hastie, Tibshirani, 2nd edition).
    • There is also an online course based on this book.
    • This is approved of by the authors, see here, provided proper citation as I am doing here.

Course contents

  • Contents of this course:
    • Foundational ML concepts that are important no matter how modern a method is.
    • Supervised learning algorithms: k-nearest neighbours, logistic regression, decision trees (random forest).
    • Unsupervised learning: principal component analysis (PCA), k-means clustering, hierarchical clustering.
    • The bias-variance trade-off and the problem of overfitting.
    • Evaluating model performance: performance metrics.
    • Model tuning and strategies to avoid overfitting: train - validation - test set splits and resampling.
    • Not covered: generative models, neural networks (deep learning). Deep learning is mainly useful if you have a lot of unstructured data (images, text, audio, …).

. . .

  • ML is a fast-moving field so I designed this course in a way that you will learn the fundamental concepts that reoccur in all fields of ML. The algorithms you will learn are very useful even in the “age of deep learning”.

  • More study resources:

Machine Learning: an Overview

What is machine learning?

  • “Machine learning (ML) is a method of teaching computers to make predictions based on some data.”
  • In supervised learning you train an algorithm to predict pre-defined data labels from your features. This results in a model.
  • Example: predicting cell types from gene expression data; predicting compound activity from chemical fingerprints; predicting house prices from location and building data, …
  • In unsupervised learning we use algorithms to find structure in complex data sets. Example: clustering gene expression data (heatmaps).
  • Generative models learn how to generate new data based on old data (images, molecules etc.). Not covered in this course. Most recent example: Large language models (LLMs) such as ChatGPT, image, video and audio generation (e.g. Midjourney).

. . .

Applications of ML in biology and medicine

  • Predicting protein structure (AlphaFold).
  • Predicting compound activity, drug mode of action, cell types, …
  • Finding regulatory sequences in DNA.
  • Segmenting images, classifying objects.
  • Classification of radiology images.
  • Clustering of “omics” data.

. . .

Many of the ideas in ML are not that new.

However, only since the advent of cheap, powerful computers (GPUs!) and increasing amounts of data have they lived up to their full potential (scale matters!).

Machine learning and data science

Some notes on the theory of ML

Regression vs. classification

  • Predicting quantitative responses = regression problem.
  • Predicting qualitative responses = classification problem.
  • Some methods estimate class probabilities and can therefore be thought of as both classification and regression methods.
  • Note that this refers to the response variable = label = outcome. Whether the predictors are qualitative or quantitative is usually less important.

Estimating \(f\)

  • We observe a response \(Y\) and p different predictors \(X = (X_1,~X_2,~\ldots~,~X_p)\).
  • We assume that there is a relationship between \(Y\) and \(X\) which can be formally written as: \[Y = f(X) + \epsilon\]
  • \(f(X)\) is a fixed but unknown function of \(X\).
  • \(\epsilon\) (epsilon) is a random error term which is independent of \(X\) and has mean zero.
  • \(Y\) is also often called the output/dependent/target variable or the label.
  • \(X\) is also often called features or input/independent variables.
  • \(f(X)\) represents the systematic information the X provides about Y.
  • Machine learning is about learning a function \(f(X)\) (=model) that is as accurate as possible (the model has low bias) and that generalises well to new/unknown data = “test data” (we say the model has low variance).

Perhaps draw a scheme. f represents the systematic information that X provides about Y.

Estimating \(f\)

  • Since \(f\) is not known, it must be estimated from the data. Estimates in statistics are denoted with a caret ^:

\[ \hat{Y}=\hat{f}(X) \]

  • \(f\) can in principle take any shape. It can be a straight line, as in a linear regression, or a plane, or a high-dimensional shape with no intuitive representation.
  • When predicting an outcome, there are two sources of error:
    • Reducible error, which results from the fact that \(\hat{f}(X) \neq f(X)\).
    • Irreducible error, symbolised by \(\epsilon\). This may result from unmeasured variables or unmeasurable variation.
  • A very common metric to assess the quality of a ML algorithm is the mean squared error (MSE). Think of linear regression: the lower the MSE the better your fit.
  • Mathematically, the expected value of the squared difference between the predicted and actual value of Y is given as:

\[ E(Y-\hat{Y})^2 = E[f(X) + \epsilon - \hat{f}(X)]^2 = \underbrace{[f(X)-\hat{f}(X)]^2}_{\text{Reducible}} + \underbrace{Var(\epsilon)}_{\text{Irreducible}} \]

Prediction vs. inference

There are two reasons why we may want to estimate \(f\):

  1. Prediction:
    • We may be mainly interested in the predictive performance of a model, i.e. we do not care about the exact form of \(\hat{f}\) as long as the estimate \(\hat{Y}\) of \(Y\) is as accurate as possible.
    • Understanding how the prediction comes about is not so important, we are happy treating the model as a black box.
    • Example: deep learning, complex decision tree algorithms, generative models.
    • Think AI = “alien intelligence”.

. . .

  1. Inference:
    • Means that we want to understand the association between \(Y\) and \(X_1, …, X_p\).
    • For example:
      • Which predictors are associated with the response?
      • What is their contribution?
      • Which ones are the most important ones?
      • Is the relationship between the predictors and the response linear or does it have a more complicated shape?
    • You know this from linear regression: here it is very clear which predictor contributes to the output.

Our first ML algorithms

K-nearest neighbours (KNN): our first ML algorithm

  • There are many different ML algorithms, all with their pros and cons.
  • We will start by learning a relatively simple, yet often useful, algorithm called k-nearest neighbours (KNN).
  • K-nearest neighbours (KNN) is an algorithm that can be used for both classification and regression.
  • This will serve to illustrate, among others, the concept of the bias-variance trade-off which is related to overfitting - two fundamental machine learning concepts.

KNN algorithm: scheme

KNN classifier with K = 3.

KNN regression with K = 1 (left) or K = 9 (right).

K-nearest neighbours (KNN)

  • More formally, the algorithm works as follows:
    • Define a number K.
    • For each new observation \(x_0\) for which a prediction is desired, identify the K nearest neighbours \(N_0\) in the training set and then assign to \(x_0\) the class which is most common among those K neighbours. Mathematically, class probabilities are calculated as follows: \[ P(Y=j|X=x_0) = \frac{1}{K}\sum_{i \in N_0}I(y_i=j). \]
  • KNN regression works similarly but instead of class probabilities the average values of the target variable \(Y\) of the training observations are used.

Logistic regression: a simple yet powerful classification algorithm

  • You are probably familiar with linear regression: y = k*x + b.
  • Note that we can have multiple variables x1 ... xn (“multiple linear regression”) and that the predictors can be qualitative.
  • Linear regression can be modified so that the response variable y is qualitative, i.e. it can be turned into a classifier.
  • This is called logistic regression and is achieved with a couple of “tricks”.

Logistic regression: derivation I

  • Setting: we are going to work with the “Default” data set: it indicates whether a customer defaulted on their credit card debt (variable default with values ‘Yes’ and ‘No’) using the predictors student, balance, and income.
  • Naive option: run a linear regression using student, balance and income as independent variables and predict the outcome as the probability of default, i.e. something like P(default = Yes|student, balance, income).
  • However, with the standard linear regression model we will get probabilities <0 and >1. Therefore, a different formula is needed, for example the logistic function:

\[ p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}} \]

  • This can be rewritten as:

\[ \frac{p(X)}{1-p(X)} = e^{\beta_0 + \beta_1X} \]

  • And by taking the logarithm we arrive at:

\[ \log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1X \]

Logistic regression: derivation II

\[ \log\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1X \]

  • The left-hand side is the logarithm of the odds, also called log odds or logit.
  • We can see that the log odds are linear in \(X\). A one-unit increase in \(X\) will lead to a \(\beta_1\) increase in the log odds.
  • Model fitting is not done via least squares but with an approach called maximum likelihood.

Exercise 1

Evaluating the performance of a machine learning model

Basic overview

  • In a regression setting usually some variant of the mean squared error (MSE) is used. Think of linear regression.
  • In a classification setting, there are a lot more options. The metric you want to use will often depend on your particular problem.

Classifier performance: the confusion matrix

Classifier performance: accuracy and sensitivity

  • What follows is an overview of the most common performance metrics used in a classification setting.
  • Accuracy is the fraction of correct predictions: \(\frac{TP + TN}{N}\).
  • The sensitivity or true positive rate (TPR, also called recall) is the probability of classifying an observation as ‘positive’, conditioned on it being truly positive: \(\frac{TP}{TP+FN}\).
  • Sensitivity answers the question: “Which fraction of positive items (”hits”) do I find?“.

Classifier performance: false positive rate and specificity

  • The false positive rate (FPR = “false alarm rate”) is the probability of classifying an observation as positive which is in fact negative: \(\frac{FP}{FP + TN}\).
  • Underlying question: “How many negative items are incorrectly classified as positive”?
  • The specificity or true negative rate (TNR) is the probability of classifying an observation as ‘negative’, conditioned on it being truly negative: \(\frac{TN}{TN+FP}\).
  • Question: “Which fraction of negative items do I find?”.

Classifier performance: Precision

  • Precision (or positive predictive value (PPV)) is defined as \(\frac{TP}{TP + FP}\).
  • Question: “When I say that an observation is positive, how often am I right?”

  • You can see, there are a lot of possible metrics that can be used to assess the quality of a predictive model. Often, what the best model or the best cutoff is will depend on the problem at hand and some sort of cost associated with a wrong decision. For example, when screening for cancer you don’t want to miss someone who has cancer, i.e. have a high sensitivity. At the same time, being overly cautious might place an unacceptable burden on the healthcare system and healthy patients.
  • Don’t worry if it takes you some time to get familiar with the terminology and concepts behind the different ways of assessing performance. It is a field that is easy to grasp but hard to master in all its subtleties. This Wikipedia article is a good reference.
  • However, the above metrics are frequently summarized using so-called ROC curves and precision-recall curves.

Receiver operating characteristic (ROC) curves

  • The presented performance metrics will typically depend on a threshold as in many cases the models output probabilities. For example, in the previous exercise using logistic regression: at which probability does one call a prediction ‘positive’?
  • To understand how the threshold affects the performance different plotting types are popular.
  • A popular choice are “receiver operating characteristic (ROC) curves”.
  • To construct a ROC curve, the threshold at which we call an observation ‘positive’ is varied and then the true positive rate is plotted agains the false positive rate (see next slide).

ROC curves

This figure illustrates how altering the threshold for calling an observation ‘positive’ influences the TPR and the FPR.


The trade-off between TPR and FPR

  • There is an inherent trade-off between TPR and FPR.
  • In order to “catch” as many positive items as possible (= high TPR) the threshold for calling something positive must be low, which will automatically increase the chances of wrongly classifying a negative observation as positive.
  • The higher the area under the ROC curve (AUC) is, the better. A diagonal line indicates a random classifier, its AUC is 0.5.

Precision-recall curves

  • Another useful quality measure is the precision-recall curve.
  • Here, precision is plotted on the y-axis and recall on the x-axis.

Exercise 2

Hyperparameter tuning and
resampling techniques

Hyperparameters

  • Random forests has the following hyperparameters:
    • The number of trees \(ntree\).
    • The number of features \(m\) to consider at each split in RF.
  • Hyperparameters cannot be learned from the data.
  • The person training an algorithm has to choose them.
  • Choosing the optimal hyperparameters is done by model tuning techniques and can improve model performance considerably.

Tuning hyperparameters: general idea

  • Pick a number of values or value combinations for the hyperparameter(s).
  • Train the model on the training set.
  • Evaluate the model performance on a separate validation set.
  • Pick the best model (so-called model selection step) and retrain on the complete data set (training + test/validation combined).
  • There are a few things to consider (coming up)!

How to choose hyperparameter values

  • There are many options.
  • Two popular choices:
    • Randomly pick values = random search.
    • Systematically search values = grid search.
  • Interestingly, random search often yields similar/better results to grid search at a much lower computational cost:

Model evaluation

  • Think back to the KNN exercise: evaluating a model using the training data is a bad idea. We risk overfitting.
  • Model evaluation during tuning has to be done using the validation/test set, not the training set.
  • How to choose a validation/test set? There are different resampling techniques.

Resampling techniques: the hold-out set approach

  • Most basic resampling technique.
  • Especially convenient if one has enough data.
  • Data is randomly split into a training set and a hold-out set, also called a validation set.
  • The model is trained on the training set and then evaluated on the validation set.
  • The model with the best validation error is selected.
  • In this way, hyperparameters can be tuned, for example the parameter K in the KNN algorithm.

A schematic of the validation set approach.

Resampling techniques: the cross-validation approach

  • One problem with the hold-out set approach is that ML models usually perform better when trained on more data.
  • At the same time, not all samples are used to estimate the validation error. This may bias the estimates.
  • This is especially a problem if one does not have a large data set available (common in biology).
  • A solution is to divide the data into \(k\) groups or folds of approximately equal size. Next, the model is trained on all but one fold and the validation error is calculated on the left-out fold.
  • This process is repeated k times until each fold has been used as a validation set exactly once.
  • The (k-fold) cross-validation error is the average of the errors on each fold:

\[ CV_{(n)} = \frac{1}{n}\sum_{i=1}^{k}MSE_i. \]

Cross-validation: scheme

  • Here is a scheme for a 5-fold cross-validation (CV):

Exercise 3

Model selection vs. model evaluation part I

  • There are two reasons why we may need a data set separate from the training set:
    1. We already know our hyperparameters and want to know how the model will perform on new data. This is known as model evaluation. The purpose is to estimate the generalisation error.

    2. We want to know what is the best set of hyperparameters. This is known as model selection.

  • In both cases a hold-out set or cross-validation can be used.
  • Important: a given hold-out set can be used to either tune an algorithm (i.e. finding the best hyperparameters) or evaluate its performance (if the hyperparameters are already known/fixed) but not both.
  • This means: when you perform model selection (hyperparameter tuning) your best model will have a certain error rate.
  • You should not report this error rate as the error rate (performance) of your model. The estimate will be overly optimistic.
  • In fact, you need yet another hold-out set to estimate the generalisation error.
  • The following slide summarises possible scenarios and what you should do.

Model selection vs. model evaluation part II

  • This is a confusing topic that is often poorly explained.
  • Therefore, let’s summarise the 3 typical scenarios and what you should do:
  • Scenario 1: you already know your hyperparameters and only want to report expected model performance
    • Split your data into 2 data sets: training and test set.
    • Train on the training set and predict on the test set. The test set performance = your expected model performance on new data (generalisation error).
    • Instead of splitting into 2 data sets you can use cross-validation.
  • Scenario 2: you only want to tune the hyperparameters
    • Like in scenario 1 but you will end up comparing lots of models with different hyperparameters and pick the best one.
    • The error of that model will be too optimistic to be a good estimate of the generalisation error. However, if you did not try too many combinations it is still an OK proxy.
    • Instead of splitting into 2 data sets you can use cross-validation.
  • Scenario 3: you want to tune hyperparameters and report expected model performance
    • Split your data into 3 data sets: training, validation, and test set.
    • Tune your model by training on the training set and evaluating on the validation set (see scenario 2).
    • Retrain using the best hyperparameters on the training + the validation set.
    • Obtain an estimate of the generalisation error by reporting the performance on the test set (cf. scenario 1).
    • If you want to use cross-validation here you will need to do nested cross-validation.
  • In all cases: Before predicting new data train your model again on the complete data set (training + validation + test).

A note on terminology

  • In this course I try to use the term “validation set” for tuning hyperparameters and “test set” to get the model performance (generalisation error).
  • Unfortunately, the terms “validation set” and “test set” are not used consistently in the manner described here. Often the test set is actually a validation set. Some people call the “validation set” the “development (dev) set”.
  • Moreover, people sometimes report the error from model tuning as the performance of the model, which is problematic.
  • In addition, it can take some time to find resources that discuss these things thoroughly and in a yet understandable way. You may find this blog post very helpful.

A full machine learning workflow

  • Putting these pieces together we can now define a full machine learning workflow:
    1. Model tuning and selection: Compare different algorithms and hyperparameter combinations by training each combination on a training set and evaluating it on a validation set using either the hold-out set or a CV approach. The model with the lowest validation error “wins”.

    2. Model evaluation: Take the best model selected in (1), train it on both the training and the validation set. Then evaluate the model by running predictions on the test set.

    3. Optional: Retrain the best model from (1) on all the data that you have. If you have many data points this is probably not necessary.

  • See also this paper and this blog post by Sebastian Raschka for an in-depth discussion.
  • The next slide is taken from his website and illustrates this procedure.

Decision trees: a family of powerful ML algorithms

The idea

  • Decision trees are a family of related machine learning algorithms that perform particularly well on tabular data.
  • In fact, in most scenarios decision trees still outperform state-of-the-art neural network based algorithms on tabular data while being much easier to tune1.
  • The basic principle of a decision tree is that it stratifies or segments observations into different regions of a predictor space by learning splitting rules.
  • For example, in the Hitters data set the aim is to predict the salary of baseball players based on the number of years played and the number of hits in the previous year.
  • In the case of decision trees, the boxes have rectangular shapes.

Regression and classification trees

  • Decision trees can be used both for regression and classification.
  • In the case of a regression tree the goal is to find region that minimize the residual sum of squares (RSS) given by

\[ \sum_{j=1}^{J}{\sum_{i \in R_j}(y_i - \hat{y}_{R_j})^2} \]

  • where J is the number of boxes and \(\hat{y}_{R_j}\) is the mean response for the training observations within the \(j\)th box.
  • For classification, an observation will be predicted to be of the class that is most common in the region to which it belongs.
  • To find the best box boundaries (cutoffs) one could use the classification error rate (the fraction of observations that do not belong to the most commonly occurring class) but instead the Gini index is normally used:

\[ G = \sum_{k=1}^{K}\hat{p}_{mk}(1-\hat{p}_{mk}). \]

  • \(\hat{p}_{mk}\) represents the proportion of training observations in the \(m\)th region that are from the \(k\)th class. G becomes small if a node contains predominantly observations from a single class, i.e. if most of the \(\hat{p}_{mk}\) are either 0 or 1.

Recursive binary splitting

  • It is not possible to consider every possible combination of boxes.
  • Therefore, a top-down, greedy approach known as recursive binary splitting: we begin at the top of the tree and look for the feature whose split at a certain cutoff leads to the greatest reduction in RSS or the lowest Gini index. Then the process is repeated recursively for each subregion of the predictor space.

  • Decision trees constructed in the way described before often overfit the training data (they suffer from high variance) and do not have the same predictive performance as other ML methods.
  • However, two famous tree-building algorithm solve this problem using different techniques: XGBoost and Random Forests.

The Random Forests Algorithm I

  • Random forests is an algorithm that uses a couple of “tricks” that enhance the predictive power of decision trees.
  • When to use:
    • Random forests can be applied on virtually any tabular data set. It even works with hundreds or even thousands of features.
    • For lower dimensional data (p < 10) regression approaches or KNN may work better and are more interpretable.
  • Disadvantage: random forests models are not easy to interpret, it is good at predicting but not the best choice for inference.

The Random Forests Algorithm II

  • A random forest (RF) is not one but a collection of several hundred decision trees that together result in a prediction.

  • At training time:

    • Each tree is trained not on the complete data set but on only on a bootstrapped sample of the training set (see the following slides). Therefore, for each tree, about 1/3 of the samples are not considered in the training process. These are the so-called out of bag (OOB) observations.
    • At each split, only a fraction \(m\) of the predictors is considered (typically \(m\approx\sqrt{p}\)).
    • The predictions during training time are performed on the out of bag (OOB) observations. This procedure is also called bootstrap aggregation or bagging.
  • At prediction time:

    • Predictions are made by a majority vote: each tree results in a prediction. The majority of predictions wins.
  • ML methods that use not one but many models are called ensemble methods. Each tree is constructed not on the complete training set but on a bootstrapped training set.

  • Motivation

    • Bagging together with only considering a subset \(m\) of predictors results in an ensemble of decorrelated trees, the resulting classifier has low overall variance.
  • In fact, the out-of-bag error is considered a valid estimate of the test error of the bagged model.

RF diagram

Bootstrapping

  • Bootstrapping is a resampling technique that takes a sample and constructs a new sample by sampling with replacement from the original data set.
  • It is used to create training data sets such as in RF but can also be used to estimate statistical parameters2.
  • It can be mathematically shown that each bootstrapped sample contains on average 2/3 of the original observations.

A graphical illustration of the bootstrap approach with 3 observations.

Exercise 4

Unsupervised learning

What is unsupervised learning?

  • Unsupervised learning refers to a type of machine learning where algorithms are trained on data without explicit labels.
  • The primary goal is to identify underlying patterns or groupings in the data, such as clustering similar data points together or reducing the dimensionality of the data.
  • Common algorithms include K-means clustering and hierarchical clustering (grouping similar data points) and Principal Component Analysis (PCA, for dimensionality reduction).
  • It is often used in exploratory data analysis, feature engineering, and scenarios where obtaining labeled data is expensive or infeasible.

Principal component analysis (PCA): motivation

  • PCA is not a clustering but a dimensionality reduction technique.
  • PCA is often used as a preparatory step for clustering.
  • Especially popular in biology because biological data are often high-dimensional, e.g. gene expression data.
  • Consider a high-dimensional data set: impossible to find patterns using hundreds of pair plots.
  • PCA allows capturing the variation in the data set in fewer dimensions: the “principal components”.
  • Changes that only affect some of the observations may be missed.

PCA: basic principle

  • PCA transforms one coordinate system into another.
  • The new coordinates are called the principal components (PCs).
  • The PCs are calculated such that the first PC (PC1) has the highest variance, PC2 the second highest variance etc.
  • PCs are orthogonal to each other.
  • A nice animated representation of the process can be found here.

PCA details

  • Make sure to center and scale your data before running PCA!
  • The principal components are calculated by multiplying the principal component loading vectors \(\Phi_{1}, \Phi_{2}, \ldots, \Phi_{n}\) with the original coordinates (features).
  • The values of the PC loading vectors are called principal component loadings.
  • The resulting values in the PCs are called principal component scores.
  • Example: to calculate the first PC score of sample \(i\) in a data set with p features:

\[ z_{i1} = \Phi_{11}x_{i1} + \Phi_{21}x_{i2} + \ldots + \Phi_{p1}x_{ip} \]

  • \(\Phi_{11}, ..., \Phi_{21}\) are the loadings of the first principal component loading vector.
  • To calculate the second PC score of sample \(i\):

\[ z_{i2} = \Phi_{12}x_{i1} + \Phi_{22}x_{i2} + \ldots + \Phi_{p2}x_{ip} \]

Exercise 5

K-means clustering

  • Partitions observations into \(K\) distinct, non-overlapping clusters.
  • K has to be specified beforehand.
  • The algorithm tries to find a partitioning that minimizes the within-cluster variation.
  • The within-cluster variation can be measured in many ways. Often, the squared Euclidean distance is used3.
  • There are \(K^n\) ways to partition \(n\) observations into \(K\) clusters. Therefore, a heuristic is used that finds a local optimum:
    • Randomly assign a number from 1 to \(K\) to each of the observations. These serve as initial cluster assignments for the observations.
    • Iterate until the cluster assignments stop changing:
      1. For each of the \(K\) clusters, computer the cluster centroid. The \(k\)th cluster centroid is the vector of the \(p\) feature means for the observations in the \(k\)th cluster.
      2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).
  • To make sure the best clustering is found the algorithm should be repeated several times.

Exercise 6

Hierarchical clustering

  • Disadvantage of K-means clustering: number of clusters K needs to be specified in advance.

  • Hierarchical clustering is a different way of clustering that results in a dendrogram.

  • The algorithm starts by treating each observation as a separate “cluster” and then works its way up:

    1. Begin with \(n\) observations and a measure (such as Euclidean distance) of all the \(\binom{n}{2}=n(n-1)/2\) pairwise dissimilarities. Treat each observation as its own cluster.
    2. For \(i=n, n-1, \ldots, 2\):
    1. Examine all pairwise inter-cluster dissimilarities among the i clusters and identify the pair of clusters that are least dissimilar (that is, most similar). Fuse these two clusters. The dissimilarity between these two clusters indicates the height in the dendro- gram at which the fusion should be placed.
    2. Compute the new pairwise inter-cluster dissimilarities among the \(i − 1\) remaining clusters.

Linkage types

To be able to compute inter-cluster dissimilarities we introduce linkage, a concept to describe the dissimilarity between clusters. The linkage type has a large impact on the resulting dendrogram.

Illustration of the algorithm

An illustration of the first few steps of the hierarchical clustering algorithm.

Interpreting a dendrogram

  • It may be tempting to think that observations that are close on the horizontal axis are similar to each other.
  • However, consider that branches can be freely rotated.
  • Therefore, the similarity between observations is solely determined by and inversely proportional to the height at which they fuse.
  • In the following figure, for example, observation 9 is no more similar to observation 2 than two observations 8, 5, or 7.

Exercise 7

Appendix

Additional considerations when working with imbalanced datasets

ROC curves and imbalanced datasets

  • When a ROC curve has a high AUC this is often presented as a sign of a good classifier.
  • However, in the case of imbalanced data (i.e. when the fraction of positive cases is small) a classifier with a high sensitivity and a small FPR can still be wrong most of the time a positive prediction is called. The classifier has a low precision. If calling false positives is associated with a cost, this can be a problem, e.g. when people are wrongly treated.
  • The reason for this is that the PPV is prevalence-dependent and has to be calculated according to the Bayes’ theorem. For example, if sensitivity and specificity are both 0.99 but the prevalence of positive cases is 0.01 then the PPV is calculated as

\[ PPV = P(pos.|pos.~test) = \frac{P(pos)*TPR}{P(pos)*TPR + P(neg)*FPR} \]

  • So the PPV (=chance of being right) is only 50%!
  • Check this online tool to interactively explore the phenomenon.
  • The reason for this has to do with the Bayes’ theorem. The following two slides illustrate the idea.

On prior and posterior probabilities I

On prior and posterior probabilities II

Conclusion

  • Base rates (class balance) matter
  • In the case of imbalanced data it is therefore advisable to also show (or ask for) a precision-recall curve. See also this paper4.
  • However, note that the prior probability in your sample may be different from the whole population: if you perform a diagnostic test you usually already have a reason to test and so your prior probability may be different from the population.
  • Many other performance metrics can be constructed from a confusion matrix, at the same time the same metric often goes by different names. This can make things very confusing, this Wikipedia article is a good reference.

Additional considerations during model tuning and cross-validations

Grouped (blocked) and stratified cross-validation

  • Sometimes not all observations in a data set are truly independent of each other but they form groups.
  • For example, there may be observations representing treatment with the same drug at different concentrations.
  • Naturally, such observations will have a high correlation with each other.
  • It is therefore important that members of a group are either in the training or the validation set but not both. Otherwise your performance estimates may be a lot too optimistic.
  • Again, such scenarios are common in biology: batch effects (!), different concentrations, cell-type specific differences etc.
  • Another important thing to consider is stratification.
  • General rule: the distribution of the labels and features training, validation, and testing set should be the same.

Illustration of grouped k-fold cross-validation (left) and stratified k-fold cross-validation (right). If needed, both can be combined.

Failure to group: an example

Can you spot the mistake?

Failure to group: an example

Later, the team issued a correction (link):

Common mistakes

Machine learning is being adopted in many fields. Therefore, people often lack experience and make mistakes. The following mistakes are, unfortunately, quite common and can lead to highly misleading results:

Serious mistakes:

  • Reporting the training error as the model performance5.
  • Performing filtering, feature selection, imputation steps on the complete data set instead of the training set. General rule: any procedure that uses the labels must only be performed on the training set!
  • Sometimes people train a model, evaluate its performance and then look at which features where the most important. Next, they retrain the model keeping only those features on the original training set and evaluate it again on the already used test set6. If you do that you need another test set.
  • Not grouping correlated observations7. Not stratifying your data.

Other mistakes/things to watch out for:

  • Reporting the validation error as the model performance. This is in fact often OK and not considered best practice. Depends on the specific case.
  • Watch out: some algorithms require your features to be rescaled or standardized.
  • Watch out: correlated features can be problematic, especially in regression approaches. Either remove them or merge correlated features into a new feature.

Footnotes

  1. Grinsztajn et al., 2022: Why do tree-based models still outperform deep learning on tabular data?↩︎

  2. This may sound like cheating but in practice often works surprisingly well.↩︎

  3. In fact, the choice of distance metric is very important when performing clustering. Popular distance metrics are for example the cosine distance or the correlation between observations.↩︎

  4. Saito and Rehmsmeier, 2015: The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets.↩︎

  5. If model performance seems to be to good to be true it probably is. Ask if you see such cases in e.g. a talk!↩︎

  6. This is very tempting in biomarker or genomics studies. However, in high-dimensional data sets it is very likely to find some feature that just by chance correlates with the outcome variable.↩︎

  7. Watch out for this especially in studies with different drug concentrations.↩︎